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Abstract 

The Probability Ranking Principle states that the document set 
with the highest values of probability of relevance optimizes infor- 
mation retrieval effectiveness given the probabilities are estimated as 
accurately as possible. The key point of the principle is the separation 
of the document set into two subsets with a given level of fallout and 
with the highest recall. The paper introduces the separation between 
two vector subspaces and shows that the separation yields a more 
effective performance than the optimal separation into subsets with 
the same available evidence, the performance being measured with re- 
call and fallout. The result is proved mathematically and exemplified 
experimentally. 

1 Introduction 

Information Retrieval (IR) systems decide about the relevance. The decision 
about relevance is subject to uncertainty. A probability theory provides a 
measure of uncertainty. To this end, a probability theory defines the event 
space and the probability distribution. The research in probabilistic IR is 
based on the classical theory, which describes events and probability distri- 
butions using, respectively, sets and set measures, according to Kolmogorov's 
axioms pQ. 

Ranking is perhaps the most crucial IR task and the Probability Ranking 
Principle (PRP) reported in [2] is by far the most important theoretical 
result to date. Although IR systems reach good results thanks to (classical) 
probability theory and parameter tuning, ranking is far from being perfect 
because useless units are often ranked at the top of, or useful units are missed 
from the retrieved document list. 
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The paper investigates whether an alternative probability theory may 
achieve further improvement. We propose Vector Probability to describe 
events and probability distributions using vectors, matrices and operators on 
them. The adoption of Vector Probability means a radical change: Vector 
Probability is based on vector subspaces whereas classical probability is based 
on sets such that the regions of acceptance and rejection of a hypothesis 
system are sets. We express Vector Probability by means of the mathematical 
apparutus used in Quantum Theory. However, the use of the mathematical 
apparatus of Quantum Theory does not end in an investigation or assertion 
of quantum phenomena in IR. Rather, we argue the superiority of vector 
probability for ranking documents over the classical probability theory. Every 
reflection on Quantum Theory and IR is out of the scope of the paper. 

The paper shows that ranking in accordance with Vector Probability is 
more effective than ranking by classical probability given that the same ev- 
idence is available for probability estimation. The effectiveness is measured 
in terms of probability of correct decision or, equivalently, of probability of 
error. We propose a decision rule based on vector subspaces such that, in 
the long run, the IR system will deem a document relevant correctly at a 
higher recall than the recall measured in the event of ranking as a result of 
classical probability when the fallout is not more than a given threshold. So, 
the decision rule minimizes the probability of error. 

We organize the paper as follows. Section [2] provides the definitions used 
in the paper: this section can be skipped if the reader has knowledge about 
Quantum Theory and Probability; further definitions can be found in [3]. 
Section [3] introduces the aspects of the probability of relevance related to 
the subsequent sections. Section 0] explains Vector Probability. Section [5] 
describes an instance of the vector probability of relevance when the Poisson 
distribution is used for an observable of a document. Section O introduces 
the optimal observable vectors. Section [7] shows that ranking by means of 
the optimal observable vectors is more effective than ranking by means of the 
best subsets of observed values. Section [8] describes an experimental study 
that confirms the theory. Section is about the measurement of observable 
vectors and makes some remarks about the actual use. Section [TU] refers to 
the main related publications and Section [11] concludes the paper. 
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Definition 


IR Example or Corresponding Concept 


Observable 

Probability Distribution 
State 

Probability of Detection 
Probability of False Alarm 
Region of Acceptance 
Observable Vector 
State Vector 


Term frequency, relevance, color, aboutness 

Distribution of probability of term frequency 

Relevance, aboutness, utility 

Recall 

Precision 

Term frequencies that are higher than a threshold 
Term frequency, relevance, color, aboutness 
Distribution of Relevance, aboutness, utility probability 



Table 1: Definitions, IR examples and concepts. 



2 Definitions 

We report and compare some definitions to IR concepts in Table [TJ 

Definition 1 (Observable) An observable is a property that can be mea- 
sured from an entity. 

Definition 2 (Probability Distribution) A probability distribution is a 
function that maps observable values to the real range [0,1]. As usual, the 
probabilities sums to 1. 

Definition 3 (Classical Probability Distribution) A classical probabil- 
ity distribution admits only sets of observable values. 

The subsets of observable values can be defined by means of the set operations 
(i.e., intersection, union, complement). 

Definition 4 (State) A state, or hypothesis, is a condition of the measured 
entity and molds the probability distribution of the measurement. 

In classical probability, "hypothesis" is more used than "state". We use 
"state" because it is used in the formalism of Quantum Theory. We corre- 
spond the null state to non-relevance and the alternative state to relevance. 
An IR system decides between the relevance state and the non-relevance 
provided an observable value. 

Definition 5 (Probability of Detection) It is the probability that the 
system decides for relevance when relevance is true; it is also called power. 
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Definition 6 (Probability of False Alarm) It is the probability that the 
system decides for relevance when relevance is false; it is also called size or 
level. 

As the probability of detection and the probability of false alarm cannot 
be simultaneously optimized, the decision rules maximize the probability of 
detection when the probability of false alarm is fixed. 

Definition 7 (Region of Acceptance) A region of acceptance consists of 
the observable values that induce the system to decide for relevance. The most 
powerful region of acceptance yields the maximum power for a fixed size. 

Note that "acceptance" does often refer to the null state in Statistics. 

The Neyman- Pearson lemma states that the maximum likelihood (ML) 
ratio test defines the most powerful region of acceptance [3]. 

Definition 8 (Probability of Correct Decision) 



provided that £ the prior probability of the null state, Pq is the probability of 
false alarm and Pd is the probability of detection. 

Definition 9 (Probability of Error) 



From Hand [2J P e + P c = 1. 

Definition 10 (Vector Space) A vector space over a field T is a set of 
vectors subject to linearity, namely, a set such that, for every vector \u) , 
there are three scalars a,b,c G J 7 and three vectors \v), \x), \y) of the same 
space such that \u) = a\v) and \u) = b\x) + c\y). If \u) is a vector, (u\ is its 
transpose, (v\u) is the inner product with \v) and \v)(u\ is the outer product 
with \v). If |(x|x)| 2 = 1, the vector is normal. If (x\y) = 0, the vectors are 
mutully orthogonal. 

We adopt the Dirac notation to write vectors so that the reader may refer to 
the literature on Quantum Theory; a brief illustration of the Dirac notation 
is in [3]. 



P c = £(l-P ) + (l-£)P d 



(i) 




(2) 
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Definition 11 (Observable and Observable Vector) An observable is 
a collection of values and of vectors. The observable vectors are mutually 
orthonormal and 1:1 correspondence with the values. 

An observable corresponds to a random variable in Statistics whereas the 
observable vectors correspond to the indicator functions. 

Definition 12 (State Vector) A state vector defines a vector probability 
distribution. The possible states (or hypotheses) correspond to state vectors. 

A state vector plays the role of parameters in Statistics. 

Definition 13 (Vector Probability) The vector probability that x is ob- 
served given the state m is |(x|m)| 2 . 

Vector probability is axiomatically defined in [5] and is applied to IR in [3J; 
the generalization to state matrices or density matrices is not necessary in 
this paper. 

3 Probability of Relevance 

Suppose that a document is represented through the random variable X such 
that X = x means that a term has frequency x. The decision is between 
relevance and non- relevance. Thus, the probability of detection is the proba- 
bility that the observed frequency belongs to the region of acceptance when 
a document is relevant and the probability of false alarm is the probability 
that the observed frequency belongs to the region of acceptance when the 
document is not relevant. 

The very general form of Probability Ranking Principle (PRP) and then 
the BM25 reported in [HI page 340] are based on the ML ratio test. The PRP 
states that, if a cut-off is defined for the fallout (i.e., the probability of false 
alarm), we would clearly optimize (i.e., maximize recall, namely, probability 
of detection, or equivalently, precision) if we included in the retrieved set 
those documents with the highest probability of relevance [2j page 297] which 
is the probability that X = x when the document is relevant. 

Therefore, the PRP and the Neyman-Pearson lemma state that, given a 
region of acceptance and then a probability of false alarm (i.e., fallout), the 
document ranking as a result of probability of relevance is optimal because 
the recall is maximum. 
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4 Vector Probability of Relevance 



Suppose that X is an observable (e.g., term frequency) and x a value of the 
set {0, 1, ... , N}. The orthonormal observable vectors that correspond to the 
values are |0), |1), . . . , \N); the actual implementation of these vectors is not 
essential. A observable vector \x) correspond to x and 

N 

X = ^2x\x)(x\ (3) 

Suppose that p(x; m) is the probability that X = x given a parameter 
m. In the event of binary relevance, m is either m (non-relevance) or mi 
(relevance). Note that m may refer to more than one parameter. However, 
we assume that m is scalar for the sake of clarity. 

Two relevance state vectors represent binary relevance: a relevance state 
vector | mo) represents non- relevance state and an orthogonal relevance state 
vector | mi ) represents relevance state. Relevance state vectors and the ob- 
servable vectors belong to an finite-dimensional vector spaceQ Thus, either 
state vectors can be defined in terms of a given orthonormal basis of that 
space and, in particular, the observable vectors are a basis. The following 
expressions 

N 

|m ) = a(x; m )\x) 

x=0 
N 

| mi) = a(x; mi)|:r) 

x=0 

a(x;m) = ±y / p(x;m) (4) 

establish the relationship between parametric distributions and vector 
spaces, namely, between the parameters mo, mi, the relevance state vectors 
| Wo), | mi) and the observable X. The sign of a(x; m) is chosen so that the or- 
thogonality between the state vectors is retained. Equations H] are instances of 
superposition. In Physics, superposition models observables that are known 
only if they are measured. In IR, the event that an observable exists only if 

In Quantum Theory, the vector spaces are complex Hilbert spaces. For the sake of 
simplicity, we do not consider the field. 



6 



observed is a much debated hypothesis. Moreover, the orthogonality of the 
relevance state vectors and the following expression 



\(y\m)\' 



N 



^2 i a{x;m)(y\x) 



x=0 



\a(y; m)\' 
p(y; m) 



due to orthogonality 



(5) 

(6) 
(7) 



establish the relationship between probability distribution and vector-based 
representation of relevance. 



5 Poisson-Based Probability of Relevance 

The Poisson distribution is used because we want to make the illustration 
of the theory accessible in the remainder of the section and consistent with 
the past literature, e.g., [H EJ EJ [10]. Moreover, the Poisson distribution is 
asymptotically derived from the Binomial distribution and approximates the 
Normal distribution. 

The observable X is the frequency of a term in a document. Thus, X = x 
means that the term occurs x times in a document. The Poisson probability 
distribution gives the probability that X = x, that is, 

Tfl X 

p(x;m) =e- m — (8) 
x\ 

provided that m is the expected term frequency. X is defined in the set of 
natural number. However, we assume that N is finite, large and equal to the 
maximum observable term frequency in the collection; indeed, the estimated 
probability that a term frequency is greater than N is zero. 

Two distinct parameters m ,mi encode non-relevance and relevance, re- 
spectively, in parametric Statistics,. Thus, 

p(x; mo) (9) 

is the probability that the term occurs x times in a non-relevant document 
and 

p(x;mi) (10) 
is the probability that the term occurs x times in a relevant document. 
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For example, consider the probability distributions in Table |5j We have 
that 

X = 0|0)(0| + 1|1)(1| = |1)(1| 

| mo ) = J=| ) + |m a ) = |l) 

p(0; m ) = |o(0; m )| 2 = ~ p(0; m x ) = |a(0; mi)| 2 = 

The Poisson-based probabilities of detection and false alarm are, respec- 
tively, 

N N 

P d = S ^p{x]m 1 ) P = ^ p(x;m ) (11) 

assuming that N is so large that p(x; m) = 0, x > N and {x a , . . . , N} is the 
region of acceptance of the state of relevance at size a. 

When the states are equiprobable and m x > m , the Poisson-based prob- 
ability of error and the Poisson-based probability of correct decision are 



TV x a -l 



^2 m o) + ^2 p ( x] mi ) ( 12 ) 



1 / f mi t Xa ~ 1 e~ t 



N x a -l 



2 . 

1 / /" mi i* 0- 1 e~* 



H 1+ l„ (15) 

The greater the difference between mo and mi, the greater P c and the smaller 
P e . If mi < m , the superscript and the subscript of the integral function 
of CT2]) swap. If mi = m , error and correct decision are equiprobable, i.e., 
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0.2 0.4 0.6 0.8 1.0 

Figure 1: Polygonal curve that minimizes the probability of error. The polyg- 
onal curve is obtained from the lowest segments connected at the intersection 
points. 

the decision is ruled through coin tossing. In other words, the discrimination 
power increases when \mi— mo| increases. Note that the increase of \m\— m Q \ 
corresponds to making the relevance state vectors orthogonal. 

Probability of error and probability of correct decision provide an alter- 
native form of the decision procedure. From Equations [1] and [2j the maxi- 
mum Pd and the minimum Pq minimize P e and maximize P c . Suppose we 
have three size values: ao < «i < «2, thus yielding three power values 
A) < 01 < 02- The prior probability minimizes the probability of error as 
follows: 

r e«o + (i-o(i-A)) o<£<£i 

minP e ={ e«i + (l-0(l-A) 6<£<6 (16) 

( e« 2 + (i-0(i-/3 2 ) 6<£<i 

such that 

a 2 -ai+^2-/3i 23 

Figure [1] shows an example of the polygonal curve with a = z> a i = 



6 = 
6 = 
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\iOi2 = | and (3q = = ^ , /3 2 = |. The abscissas of the intersection 

points are £1, £2- 

6 Optimal Observable Vectors 

Let us recapitulate some facts. Neyman-Pearson's lemma states that the set 
of term frequencies can be partitioned into two disjoint regions: one region 
includes all the frequencies such that relevance will be accepted; the other 
region denotes rejection. If a term is observed from documents and only 
presence/absence is observed, the set of the observable values is {0, 1} and 
each region is one out of possible subsets, i.e., 0, {0}, {1}, {0, 1}. 

If term frequency is observed instead, the observable values are 
{0, 1, . . . , N} and each region is one out of possible subsets of {0, 1, ... , iV}. 
Note that the ML ratio test yields two subsets, i.e., {0, ...,x a — 1} and 
{x a , . . . , iV}. An alternative region can be defined with only set operations 
(intersection, complement, union). However, set operations cannot define 
more powerful regions than that dictated by dint of the Neyman-Pearson 
lemma. 

The subspaces are equivalent to subsets and they then can be subject to 
set operations if they are mutually orthogonal [TT] . The subsets yielded 
by dint of the ML ratio test become {|0)(0|, . . . , \x a — l)(x a — 1|} and 
{\x a )(x a \, . . . , |iV)(iV|} and can be subjected to set operations because they 
are mutually orthogonal. 

Suppose that the vector subspaces that correspond to the subsets yielded 
by dint of the ML ratio test, are rotated so that the new subspaces are oblique 
to them. The new oblique subspaces cannot be reformulated in terms of the 
observable vectors through set operations and thus they represent something 
different. 

Figure [2] shows three-dimensional vector space spanned by |ei), {e^}, {e^) 
to make the difference between subsets and subspaces. The ray (i.e. one- 
dimensional subspace) L x is spanned by \x), the plane (i.e. two-dimensional 
subspace) L XjV is spanned by \x),\y). Note that L ei)£2 = L X)V = L eitV and so 
on. According to p2J page 191], consider the subspace L e2 f\(L y \J L x ) provided 
that A means "intersection" and V means "span" (and not set union). Since 
L y V L x = L X)V = L eiie2 , L e2 A (L y V L x ) = L e2 A L eu£2 = L e2 . However, 
[L e2 A Ly) V (L e2 A L x ) = because L e2 A L y = and L e2 A L x = 0, therefore 

L e2 A (L y V L x ) £ (L e2 A L y ) V (L e2 A L x ) 
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Figure 2: The difference between subsets and subspaces. 



thus meaning that the distributive law does not hold, hence, set operations 
cannot be applied to subspaces. 

The key point is that, if the subspaces are rotated in an optimal way, we 
can achieve the most powerful regions; these regions cannot be ascribed to 
the subsets yielded by dint of the ML ratio test. The following Helstrom's 
theorem is the rule to compute the most powerful regions of a vector space 
provided two state vectors. 

Theorem 1 Let |mi), |mo) be the state vectors. The region of acceptance at 
the highest probability of detection at every probability of false alarm is given 
by the eigenvectors of 

\m 1 ){m x \ - \m }{m \ (17) 
whose eigenvalues are positive. 

Proof See [13J. (The |mj)(mj| are defined in Section [2j) ■ 
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Definition 14 An optimal observable vector is a vector that divides the re- 
gion of acceptance from the region of rejection as stated by Theorem QJ 

The optimal observable vectors always exist due to the Spectral Decomposi- 
tion theorem. [TJ] 

The angle between the relevance state vectors [mi) , \mo) determines the 
geometry of the decision between the two states. Suppose |/x ) are two 
observable vectors. They are mutually orthogonal because are eigenvectors 
of ffTTl) and can be defined in the space spanned by the relevance state vectors. 
The probability of correct decision and the probability of error are given by 
the angle between the two relevance state vectors and by the angles between 
the observable vectors and the relevance state vectors; how the geometry 
defines the optimal ranking is described in the next section. 



7 Optimal Probability Ranking 

Figure |3(a)| illustrates the geometry of decision and the observable vectors. 
(The figure is in the two-dimensional space for the sake of clarity, but the 
reader should generalize to higher dimensionality than two.) The angles 770, r)i 
between the observable vectors and the relevance state vectors \m ), \mi) are 
related with the angle 7 between |m ), \nii) because the observable vectors 
are always mutually orthogonal and then the angle is | = rjo + J + rj\. 

The optimal observable vectors are achieved when the angles between an 
observable vector and a relevance state vector are equal to 

= ^p- (18) 
The rotation of the non-optimal observable vectors such that fll8p holds, 



yields the optimal observable vectors |//i), |/xq) as Figure 3(b) illustrates: the 



optimal observable vectors are "symmetrically" located around the relevance 
state vectors. 

The replacement of the angle between an observable vector and a rele- 
vance state vector with the angle of ( fl8|) yields the minimal probability of 
error and the maximal probability of correct decision, that is, 

Qe = \(l~ Vl-4^(1-0|^| 2 ) Qc = 1 " Qe (19) 
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(see [13]) given that 



\X\ 



|(m |mi)|' 



N 

£ 

x=0 



-mi 



(20) 



is the squared cosine of the angle between the relevance state vectors if the 
Poisson distribution is used. Figure ES superposes the polygonal curve plot- 
ted for P e and the bell-shaped curves plotted for Q e with \X\ 2 = 0.90 and 
\X\ 2 = 0.50. \X\ 2 measures the degree to which the distributions of proba- 
bility of relevance and non-relevance are distinguishable. The less they are 
distiguishable, the higher Q e . Indeed, the probability of error increases when 
the distribution of probability of relevance is very similar to the distributions 
of probability of non- relevance. 
We prove the following 



Proposition 1 For all m , mi, x a , 

Qc ^ -Pc Qe — Pe 



(21) 



Proof Let x a > and let mi > mo - the complement is proved in similar 
way. 

, r mi +B a -i p -t 

(22) 



Q c > P c if and only if y/l - \X\ 2 > 



T(x ( 



dt 



because ([T]) and © also hold for Q c , Q e . Moreover, ( 122]) holds if 



1 - IXI 2 > 



"Hi 



T(Xr 



dt 



(23) 



because the sides of ( 1221) lies between and 1. The calculation of the angle 
between the relevance state vectors yields 



|j^|2 _ e - [mi -mo | 



(24) 



The relationships between the Poisson distribution and the Gamma function 
allows us to state that 



I — e -\mi-mo\ 



-mi 



E m l _ -mi \^ m 
x' X^ 



(25) 



x=0 



x=0 
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We split the summations in (125|) . thus achieving that holds if and only 
if 



2 y e --^_ + 

^ x! 

( e — o _ e -mi) H!i > o (26) 



x=0 



Every operand of the sum f |26l) is not negative, thus proving the left side 
of ( 12TI) . The right side is proved in symmetric way due to (j2J) when applied 

tO<5e,<5 c . ■ 

Proposition [T] tells us that the decision as to whether a document is relevant 
is most effective when the test is function of the optimal observable vectors 
even if the Poisson-based probability is estimated as accurately as possible. 

Corollary 1 For all m ,mi,x a and if Qq = Pq, 

Qd > Pd (27) 



Proof Let Q = P , recall © and the left side of ([21]). Thus, 

0<Qc-P c =(l-O(Qd-Pd) (28) 

We have that Q c > P c because Qd > Pd and < £ < 1. I 

Corollary [1] tells us that, once the probability of false alarm is fixed at an 
arbitrary size, the state that a document is relevant is correctly accepted 
with vector probability that is higher than any Poisson-based probability. 

The key point is that the region of acceptance induced by the optimal 
observable vectors yUi,/io is more powerful than the region of acceptance of 
the PRP, when the Poisson distribution measures the probability of term 
frequency, all the other things being equal. A distribution different from 
Poisson's or a different estimation of the probability values might revert the 
outcome of Corollary [TJ Does the power of the region of acceptance induced 
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through the optimal observable vectors and then Corollary [T] depend on the 
probability of term frequency? In the remainder of the section, we generalize 
the result for either distribution of probability of term frequency. 

The probability of term frequency is given by two items: (i) the proba- 
bility values estimated for each x; (ii) the distribution used to compute the 
probability of term frequency. 

As for (i), note that the vector probability and the classical prob- 
ability of relevance are functions of the same probability distributions 
p(x, mo),p(x, mi) for every mo,m\. Thus, the power of the region of accep- 
tance induced through the optimal observable vectors and then Corollary [1] 
do not depend on the probability values calculated for each x, m. 

As for (ii), what distinguishes the probability of detection (and false 
alarm) computed with the optimal observable vectors from those computed 
with the region of acceptance given as a result of the PRP (i.e., the ML ratio 
test) is the region of acceptance. We prove that the region of acceptance 
given through the optimal observable vectors is always more effective than 
the region of acceptance given as a result of the PRP, that is, independently 
of the probability distribution of the observable. 

Suppose that p(x;rrij),j = 0,1 are two arbitrary probability distribu- 
tions indexed by the parameters mo, mi, the latter indicating the probability 
distribution of term frequency in non-relevant documents and in relevant 
documents, respectively. 

Theorem 2 For every p(x; Trij),j = 0, 1 and x a 

Qc >Pc Qe< Pe (29) 



Proof Consider Figures 3(a) and 3(b) A probability of detection pd and a 



probability of false alarm po defines the coordinates of |mo) and \m±) with a 
given orthonormal basis |eo), \ei) (that is, an observable): 



|m ) = yl -p d \e ) + y/p~d\ei) (30) 
\ m i) = VPo\eo) + V 1 -Poki) (31) 

The coordinates are expressed in terms of angles: 

1 - Pd = sin 2 77! po = sin 2 r] (32) 

provided that rji is the angle between \rrti) and |ei). 
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The probability of error is 

p e = & + (1 - 0(1 - Pd) = f sin 2 770 + (1 - sin 2 rfr (33) 

The probability of error is minimum when 770 = rji = 9 as shown in [131 P a g e 
99]. 

But, 6 is exactly the angle between \rrii),i = 0, 1 and and is defined 
as a result of Equation (fl8l) . The probability of error is then minimized when 
the observable vectors are the = 0, 1. 

Therefore, Q e < P e for all P e , that is, for all the observable vectors. As 
Q c = 1 — Q e , the probability of detection is also maximum. I 



Suppose, for example, that X is a binary observable, that is, x G {0, 1}. 
The probability distributions are in Table [5j Suppose that the size of the 
test is a = \. Thus, relevance is accepted when X = 1, = Pq = ■= 



and, if the states are equiprobable, P e = \ (P + 1 — Pd) = \ and P c 



2 ^ •* a > 5 

2 (± — jtq -r r d ) = g. 

The optimal observable vectors are 



1 \ ( °- 97 ^ I v ( 0.23 \ fOA , 
lA*i> = ^ _o. 23 J l^o> = ^ 9? J (34) 

These vectors can be computed in compliance with [15] . Hence, the region of 
acceptance is the subspace spanned by and, if the states are equiprobable, 
Qe = \ (Qo + 1 - Qd) = 0.05 and Q c = | (1 - Q + Qd) = 0.95. Hence, if we 
were able to find the optimal observable vector and to actually measure it, 
retrieval performance would be much higher than the performance achieved 
through the classical region of acceptance. 



8 Experimental Study 

We have tested the theory illustrated in the previous sections through experi- 
ments based on the TIPSTER test collection, disks 4 and 5. The experiments 
aimed at measuring the difference between P e and Q e by means of a realistic 
test collection. To this end, we have used the TREC-6, 7, 8 topic sets. The 
experimental algorithm is explained in Figure |5j 

We have implemented the following test: p(x; m) has been computed for 
each topic word and for each m by means of the usual relative frequency 
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f w (x; m)/ f w (x; m) assuming that f w (x;m) is the frequency of w in the 
documents with state m. Note that we do not aim at measuring effectiveness; 
rather, we aim at measuring the difference between probabilities of error given 
a document ranking. 

Consider Figure Q e is always not greater than P e for every size a and 
for every prior probability £. The superposition of linear curves, one curve 
for each a, yields a polygonal curve like Figure [TJ 

Some linear curves are secant because they cut a bell-shaped curve in two 
parts. However, they refer to different words: given a word, the linear curve 
is never secant of the bell-shaped curve. 

Figures [3, [HI M illustrate the plots for other three topics; these topics 
are representative of the main types of plot - the plots of all the 450 topics 
exhibit the similar pattern. 

9 Discussion 

The precedent example points out the issue of the measurement of the opti- 
mal observable vectors. Measurement means the actual finding of the pres- 
ence / absence of the optimal observable vectors via an instrument or device. 
The measurement of term frequency is straightforward because term occur- 
rence is a physical property measured through an instrument or device. (A 
program that reads texts and writes frequencies is sufficient.) The measure- 
ment of the optimal observable vectors is much more difficult because no 
physical property does correspond to them and cannot be expressed in terms 
of term frequencies. [11] Thus, the question is: what should we observe from 
a document so that the outcome corresponds to the optimal observable vec- 
tor? The question is not futile because the answer(s) would effect automatic 
indexing and retrieval. In particular, if we were able to give an interpre- 
tation to the optimal observable vectors, retrieval and indexing algorithms 
could measure those vectors. 

Following [3], three interpretations of the optimal observable vectors can 
be provided: 

• Geometrically, each vector is a superposition of other two independent 
vectors. Figure [10] depicts the way the vectors interact and shows that 
the observation of a binary feature places the observer upon either |0) 
or |1) whenever he measure or 1, respectively. There is no way to 
move upon \fi ) or l/^} because /i , yUi cannot be measured. 
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• Probabilistically, the observable vectors and the state vectors are re- 
lated as a result of the rule of Equation (Also known as trace rule 
because the general form of the equation is the trace of two matrices.) 
As the observable vectors are mutually orthonormal by definition, they 
induce a valid probability distribution. 

• Logically, the observable vectors are assertions, e.g., X = x corresponds 
to \x). The basic difference between subspaces and subsets is that the 
vectors belong to a subspace if and only if they are spanned by a basis 
of the subspace. However, the logic to combine subspaces cannot be 
the set-based logic used to combine subsets. 

In classical probability theory, if we observe 1, we say that every document 
described by 1 is either relevant or not relevant, when relevance is measured. 
In general, we say that it either possesses a property or does not, when a 
property is measured. Hence, if an observable is described as sets of values 
(e.g., the set of documents indexed by a term), we can always describe rele- 
vance as a set. That is, the union of the set of relevant documents indexed 
by a term with the set of relevant documents not indexed by the term. 

The orthogonality between the observable vectors implements the mu- 
tually exclusiveness between the observable values. Hence, if we observe 0, 
we can say only that we do not observe 1, but cannot say anything about 
Hi because is oblique to |0), |1) and vector subspace complement, union 
and intersection are not the same as subset complement, union and intersec- 
tion [3]. 

At the present time, an IR system is capable of measuring observable 
vectors like |0), |1) which correspond to term occurrence. The documents can 
be ranked as specified by the PRP, thus achieving P e , which is the current 
lower bound provided that the probabilities are estimated as accurately as 
possible [2]. 

Q e and Q c can be achieved if and only if an IR system is capable of ob- 
serving the optimal observable vectors (Theorem[T]). If an IR system observed 

or Hi in a document, the system would decide whether the document is 
relevant with probability of error Q e . 

The open problem is due to the difficulty of observing the optimal ob- 
servable vectors in a document; if a system is given a textual document as 
input, how can it decide if the document would provide either |/i ) or (or 
the corresponding eigenvalues) if the optimal observable vectors were mea- 
sured? We shall pay a great deal of attention to the question because, if the 
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problem were solved, the solution would give a significant breakthrough in 
IR research. 

10 Related Work 

Van Rijsbergen's book [3] is the point of departure of our work. Helstrom's 
book [13] provides the theoretical foundation for the vector probability and 
the optimal observabl vectors. Eldar and Forney's paper [15] gives an algo- 
rithm for the optimal observable vectors. Hughes' book [12] is an excellent 
introduction to Quantum Theory. An introduction to Quantum Theory and 
Information Retrieval can be found in [16J. In [17J the authors propose quan- 
tum formalism for modeling some IR tasks and information need aspects. 
The paper does not limit the research to the application of an abstract for- 
malism, but exploits the formalism to illustrate how the optimal observable 
vectors significantly improve effectiveness. In [18], the authors propose \X\ 2 
for modifying probability of relevance; \X\ 2 is intended to model quantum 
correlation (also known as interference) between relevance assessments. 

11 Conclusions 

The research in IR has been traditionally concentrated on extracting and 
combining evidence as accurately as possible in the belief that the observed 
features (e.g., term occurrence, word frequency) have to ultimately be scalars 
or structured objects. The quest for reliable, effective, efficient retrieval al- 
gorithms requires to implement the set of features as best one can. The 
implementation of a set of features is thus an "answer" to an implicit "ques- 
tion" , that is, which is the best set of features for achieving effectiveness as 
high as possible? We suggest to ask another "question" to achieve an even 
better answer: Which is the best subspacel 
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(b) Optimal observable vectors are mutually orthogonal and symmetrically placed around 
|ra ), |mi) 

Figure 3: Geometry of decision and observable vectors 
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0.2 0.4 0.6 0.8 1.0 

Figure 4: Probabilities of error: P e vs. Q e . Figure Q] describes the polygonal 
curve. The shortest bell-shaped curve corresponds to £ = 0.50, whereas the 
other bell-shaped curve corresponds to £ = 0.90. 



sort data by increasing f w (x; m) and by m 
for all topic t do 

extract title and description of t 
for all topic word w of t do 
compute p m (w; x) for every x 
compute \X\ 2 

for all size a E {0.25,0.50,0.75} do 
for all prior £ G {0.01, . . . , 0.99} do 
compute P e , P c 
compute Q e , Q c 
end for 
end for 
end for 
end for 

Figure 5: The experimental algorithm 
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(a) a = 0.25 



(b) a = 0.50 



(c) a = 0.75 



Figure 6: P e and Q e plotted against £ for each word of topic 439 and for each 
a G {0.25, 0.50, 0.75}; each curve corresponds to a word: + labels classical 
probability of error curves, x labels vector probability of error curves. 




(a) a = 0.25 (b) a = 0.50 (c) a = 0.75 



Figure 7: 
ures 7(a^ 



P e and Q e plotted against £ for each word of topic 301 (Fig- 



7(b) and 7(c))) and for each a G {0.25,0.50,0.75}; each curve 
+ labels classical probability of error curves, x labels 



corresponds to a word 
vector probability of error curves. 




(a) a = 0.25 (b) a = 0.50 (c) a = 0.75 



Figure 8: 
ures 



P e and Q e plotted against £ for each word of topic 303 (Fig- 



8(a), 8(b) and 8(c))) and for each a G {0.25,0.50,0.75}; each curve 



corresponds to a word: + labels classical probability of error curves, x labels 
vector probability of error curves. 
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(a) a = 0.25 



(b) a = 0.50 



(c) a = 0.75 



Figure 9: P e and Q e plotted against £ for each word of topic 426 (Fig- 
ures 9(a), 9(b) and 9(c)) and for each a G {0.25,0.50,0.75}; each curve 



corresponds to a word: + labels classical probability of error curves, x labels 
vector probability of error curves. 




Figure 10: A geometric view of incompatible observable vectors 
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